This script takes ZIP-based alignment files and creates a high level national map. It was developed for 36 sales regions, but could be useful for any national level alignment ranging from 2 to 100 different regions, territories.
Before starting you just need a file that has:
ZIP Code Alignment Category (eg. Sales Region, Customer Care Region) Importance Variable** (You can default this to 1)
The Importance Variable is used to weight a ZIP code’s importance for a region. A ZIP code in downtown Chicago in a highly populated area probably holds more business importance than a rural, sparsely populated ZIP Code 300 miles outside of Chicago. If you want all ZIP codes to be treated equally, just set the Importance Variable to 1 for all ZIP codes. In this script, we are using 2019 Video Procedure volumes per ZIP code as the Importance Variable.
Importing Data
The first part of the script is importing data and software libraries.
We get our shapefiles from the US Census website. We are using state shapes, congressional district shapes and ZIP shapes. We could also use counties for more granularity, but at the national level the congressional districts should do fine.
The sf library in R is used for handling shape files. Libraries tmap, rmapshaper and smoothr are all map-specific libraries. Library wesanderson gives us a color palette brewer.
We also read in our alignment file and assign it to a dataframe called zip_align. We merge that by ZIP to the ZIP code shape file.
Let’s print the state map to get our bearings on the shape file.
library(sf)
## Linking to GEOS 3.8.0, GDAL 3.0.4, PROJ 6.3.1
library(tidyverse)
## Warning: package 'tidyverse' was built under R version 3.5.3
## -- Attaching packages --------------------------------------------------------------------------------- tidyverse 1.3.0 --
## v ggplot2 3.2.1 v purrr 0.3.3
## v tibble 3.0.3 v dplyr 0.8.4
## v tidyr 1.0.2 v stringr 1.4.0
## v readr 1.3.1 v forcats 0.4.0
## Warning: package 'ggplot2' was built under R version 3.5.3
## Warning: package 'tidyr' was built under R version 3.5.3
## Warning: package 'readr' was built under R version 3.5.3
## Warning: package 'purrr' was built under R version 3.5.3
## Warning: package 'dplyr' was built under R version 3.5.3
## Warning: package 'stringr' was built under R version 3.5.3
## Warning: package 'forcats' was built under R version 3.5.3
## -- Conflicts ------------------------------------------------------------------------------------ tidyverse_conflicts() --
## x dplyr::filter() masks stats::filter()
## x dplyr::lag() masks stats::lag()
library(smoothr)
## Warning: package 'smoothr' was built under R version 3.5.3
##
## Attaching package: 'smoothr'
## The following object is masked from 'package:stats':
##
## smooth
library(randomForest)
## Warning: package 'randomForest' was built under R version 3.5.3
## randomForest 4.6-14
## Type rfNews() to see new features/changes/bug fixes.
##
## Attaching package: 'randomForest'
## The following object is masked from 'package:dplyr':
##
## combine
## The following object is masked from 'package:ggplot2':
##
## margin
library(wesanderson)
## Warning: package 'wesanderson' was built under R version 3.5.3
library(tmap)
library(rmapshaper)
## Warning: package 'rmapshaper' was built under R version 3.5.3
zip_shapes <- st_read("C:/Users/nswitzer/Desktop/Shape Files/tl_2019_us_zcta510")
## Reading layer `tl_2019_us_zcta510' from data source `C:\Users\nswitzer\Desktop\Shape Files\tl_2019_us_zcta510' using driver `ESRI Shapefile'
## Simple feature collection with 33144 features and 9 fields
## geometry type: MULTIPOLYGON
## dimension: XY
## bbox: xmin: -176.6847 ymin: -14.37378 xmax: 145.8305 ymax: 71.34132
## geographic CRS: NAD83
state_shapes <- st_read("C:/Users/nswitzer/Desktop/Shape Files/State Shape Files")
## Reading layer `cb_2018_us_state_500k' from data source `C:\Users\nswitzer\Desktop\Shape Files\State Shape Files' using driver `ESRI Shapefile'
## Simple feature collection with 56 features and 9 fields
## geometry type: MULTIPOLYGON
## dimension: XY
## bbox: xmin: -179.1489 ymin: -14.5487 xmax: 179.7785 ymax: 71.36516
## geographic CRS: NAD83
district_shapes <- st_read("C:/Users/nswitzer/Desktop/Shape Files/Congressional District Shape Files")
## Reading layer `cb_2018_us_cd116_500k' from data source `C:\Users\nswitzer\Desktop\Shape Files\Congressional District Shape Files' using driver `ESRI Shapefile'
## Simple feature collection with 441 features and 8 fields
## geometry type: MULTIPOLYGON
## dimension: XY
## bbox: xmin: -179.1489 ymin: -14.5487 xmax: 179.7785 ymax: 71.36516
## geographic CRS: NAD83
##Read in Alignment file
#zip_align <- read.csv("C:/Users/nswitzer/Desktop/MarketShare Refresh/Endo ZIP Alignments.csv", colClasses = "character" )
zip_align <- read.csv("C:/Users/nswitzer/Desktop/Full Nation ZIP procedures.csv", colClasses = "character")
names(zip_align) <- c("ZIP", "Importance", "Hybrid_Region")
full_align <- merge(zip_align, zip_shapes, by.x ="ZIP", by.y = "ZCTA5CE10", all.x = TRUE)
st_geometry(full_align) <- full_align$geometry
ggplot(state_shapes) + geom_sf()
Nailed it! So comprehensive, we even have Guam and American Samoa.
Cropping the continental US
Let’s focus on just the continental US for now. This code creates a filter box, use ST_CROP to reduce our state and district files to just the region we need.
filter_box <- st_bbox(c(xmin = -126, xmax = -67, ymax = 25, ymin = 50), crs = st_crs("NAD83"))
district_shapes <- st_crop(district_shapes, filter_box)
## although coordinates are longitude/latitude, st_intersection assumes that they are planar
## Warning: attribute variables are assumed to be spatially constant
## throughout all geometries
state_shapes <- st_crop(state_shapes, filter_box)
## although coordinates are longitude/latitude, st_intersection assumes that they are planar
## Warning: attribute variables are assumed to be spatially constant
## throughout all geometries
ggplot(district_shapes) + geom_sf()
Relating States to ZIPs
This is where the magic of spatial data analysis happens. In normal non-spatial attribute-driven data, to relate ZIP code alignments to congressional districts and states, we would need to relate ZIP codes to states and congressional districts using merges that use ZIP code as the key. While that is available, we can skip that import step by using a spatial join instead.
The spatial join relates the ZIP shapes to the state shapes using their overlapping geography. This is the st_join function from the sf package.
We will run through a sample.
First, we will make a smaller dataset with only Alabama. We will then use st_join to find the appropriate aligned region from our ZIP alignments.
There it is. Alabama. Home of the Crimson Tide, Jacob Nail’s favorite football team and the second coolest collegiate program to use an elephant mascot.
Now we use st_join to join this single geography (Alabama) to the overlapping zip_alignment file. In this case, we will only select the largest ZIP overlap with Alabama. If we did not set largest = TRUE in the st_join call, then this would return every ZIP code that overlaps with Alabama. We will use that later.
Let’s see if this aligns Hybrid_Region correctly by printing out the Alabama geography object joined to the ZIP Alignment file. This merge object is called ‘joined_one’.
start_time <- Sys.time()
joined_one <- st_join(single_dist, full_align, largest = T)
## although coordinates are longitude/latitude, st_intersection assumes that they are planar
## Warning: attribute variables are assumed to be spatially constant
## throughout all geometries
end_time <- Sys.time()
time_taken <- end_time - start_time
joined_one
There it is - Hybrid Region is Southern. This appears to work.
Moving forward however, we will use the Importance column to align geographies rather than the largest overlap.
State-wise Regions
In this section, we categorize the states are that are comprised of almost one region. This will reduce the complexity of our work later when we go District by District or ZIP code by ZIP code. It mostly helps in geographically large states with geographically small but highly concentrated population centers (eg. Salt Lake City, Utah).
This top-down, state-based method is best for the large approximate map focused on aesthetics. It fits with how people are used to understanding the country.
At a high level, we are going to simplify the regional boundaries to state lines wherever possible. Some states are filled by only one region (eg Maine/New England, Oklahoma/Great Plains). Using the Importance variable as a filter, we will make a simple dataset of state-geographies and regions. Then, for the states that need to broken down further we use congressional districts.
Recipe: 1. Fill single-region states with region data. 2. For states with multiple regions, break down state into congressional districts. Assign congressional districts based on ZIP codes. 3. Join states dataset and congressional district dataset into one dataset. 4. Manually identify any congressional districts that we want to manually overwrite. 5. Dissolve the internal state and congressional district boundaries to make regional boundaries. 6. Smooth and simplify.
The first step is to join the states to the ZIPs to get possible alignments. We use the spatial join again, but this time every ZIP that overlaps a state gets a row in the dataframe.
start_time <- Sys.time()
state_shapes_merged <- st_join(state_shapes, full_align[!is.na(full_align$Hybrid_Region),],left = T)
## although coordinates are longitude/latitude, st_intersects assumes that they are planar
end_time <- Sys.time()
#district_shapes$nearest_region <- full_align$Hybrid_Region[district_shapes$nearest_region_index]
#str(district_shapes_merged)
#print(end_time-start_time)
head(state_shapes_merged)
Each above line has the geometry for the state listed. Of course, for each state we only need one geometry, so we need to reduce this.
Now we need to see the “Importance” variable for each state-region combination. For this, we will aggregate the data using group_by and summarise* from the tidyverse package. (*The tidyverse was mainly created by Hadley Wickham, a Kiwi. As a citizen of a Commonwealth country, he uses the British versions of many words.)
#Runs forever
#tm_shape(district_shapes_merged) + tm_polygons(district_shapes_merged, id ="Hybrid_Region") + tm_fill("Hybrid_Region", palette = wes_palette("Moonrise1", 3, type = "continuous"))
tmap_mode("view")
## tmap mode set to interactive viewing
#tm_shape(district_shapes_merged) + tm_fill("Hybrid_Region", labels = "Hybrid_Region", palette = wes_palette("Moonrise1", 3, type = "continuous"))
system.time(agg_by_state <- state_shapes_merged %>%
group_by(STATEFP, Hybrid_Region) %>%
summarise(importance = sum(as.numeric(Importance)), GEOID = first(GEOID)) %>%
arrange(desc(importance)))
## user system elapsed
## 3.08 0.10 3.17
head(agg_by_state)
The agg_by_state dataset now has unique Region and State relationships with cumulative importance values.
For each state, we need to see what relative importance each region has. If the relative importance of a region is over a threshhold, we will assign that entire state to one region. This threshhold is 75% of the importance ratio for now, but check the code.
agg_by_state$importance_in_state <- sapply(agg_by_state$STATEFP, function(a) sum(agg_by_state$importance[agg_by_state$STATEFP == a]))
agg_by_state$importance_ratio <- agg_by_state$importance / agg_by_state$importance_in_state
agg_by_state <- subset(agg_by_state, importance_ratio > .75)
#How many states are mostly covered
#for (i in seq(.1,1,.1)){
# print(i)
# print(length(agg_by_state$STATEFP[agg_by_state$importance_ratio> i]))
#}
region_count <- length(unique(state_shapes_merged$Hybrid_Region))
tmap_options(max.categories = region_count)
#tm_shape(agg_by_state) + tm_fill("Hybrid_Region")
ggplot(agg_by_state, aes(fill = Hybrid_Region)) + geom_sf() + theme(legend.position = "none") + ggtitle("States fully assigned to regions")
Alright! These states are fully assigned to a region so we do not have to break them down into congressional districts. We can put that dataset aside for a while.
Now that we have chunked out the states with a vast majority of their markets overlapping with one region, we can get into breaking down the states with more nuanced breakdowns.
Congressional districts
First, we filter the district shapes to the ONLY the districts in states that are not fully taken by one region. Then we use a spatial merge to assign them a region.
filtered_district_shapes <- subset(district_shapes, !(STATEFP %in% agg_by_state$STATEFP) )
district_shapes_merged <- st_join(filtered_district_shapes, full_align[!is.na(full_align$Hybrid_Region),], left = T, largest = T)
## although coordinates are longitude/latitude, st_intersection assumes that they are planar
## Warning: attribute variables are assumed to be spatially constant
## throughout all geometries
#Using Importance
#system.time(agg_by_district <- district_shapes_merged %>%
# group_by(GEOID, Hybrid_Region) %>%
# summarise(importance = max(sum(as.numeric(Importance))), STATEFP = first(STATEFP)) %>%
# arrange(desc(importance)))
agg_by_district <- district_shapes_merged
ggplot(agg_by_district, aes(fill = Hybrid_Region)) + geom_sf() + theme(legend.position = "none") + ggtitle("Congressional Districts Aligned by Largest ZIP")
#ggplot(district_shapes_merged, aes(fill = Hybrid_Region)) + geom_sf() + theme(legend.position = "none") + ggtitle("Congressional Districts Aligned by Largest ZIP")
##tm_shape(district_shapes_merged) + tm_fill("Hybrid_Region")
Yes! We successfully added alignments to the districts. Now let’s combine these with the state assignments from earlier using a row bind. To do this, we will need to make both data frames have the same names.
We’ll also use this map for identifying necessary overwrites, so we’ll print GEOID on each state/district shape. Also, we will use tmap to make an interactive map.
agg_by_district$importance_in_state <- NA
agg_by_district$importance_ratio <- NA
agg_by_district$importance <- NA
shapes_and_districts <- rbind(agg_by_state, agg_by_district[, names(agg_by_district) %in% names(agg_by_state)])
tm_shape(shapes_and_districts) + tm_fill("Hybrid_Region") + tm_text("GEOID")
The above map is hairy, but I can use it to quickly identify the GEOIDs I want to reassign. West Idaho should be showing up for Pacific Mountain, but for some reason it’s Northwest. If we were going for an accurate representation, then I should go deeper into the source data. However, we are going for a high level map.
In this case, I am going to hard-code any overrides, but we could easily write this a file and read from that if that would make it more sustainable.
shapes_and_districts$Hybrid_Region[shapes_and_districts$GEOID == "1601"] <- "Pacific Mountains"
shapes_and_districts$Hybrid_Region[shapes_and_districts$GEOID == "5402"] <- "Steel Cities"
#shapes_and_districts$Hybrid_Region[shapes_and_districts$GEOID == "4504"] <- "Tar Heel"
shapes_and_districts$Hybrid_Region[shapes_and_districts$GEOID == "1302"] <- "South Atlantic"
shapes_and_districts$Hybrid_Region[shapes_and_districts$GEOID == "5502"] <- "Northwoods - NS"
shapes_and_districts$Hybrid_Region[shapes_and_districts$GEOID == "3502"] <- "Southwest - SW"
tm_shape(shapes_and_districts) + tm_fill("Hybrid_Region")
Dissolving Boundaries
Sounds so peaceful.
Now let’s get rid of the state and district boundaries (dissolve) them by doing a summary so that only regions have borders.
system.time(hybrid_region_align <-shapes_and_districts %>%
group_by(Hybrid_Region) %>%
summarise(area = sum(as.numeric(importance))))
## user system elapsed
## 2.52 0.02 2.53
#str(hybrid_region_align)
pal <- wes_palette("Darjeeling1", region_count, type = "continuous")
ggplot(hybrid_region_align, aes(fill=Hybrid_Region)) + geom_sf() + ylim(24,51) + xlim(-129,-65) + theme(legend.position = "none")
#tm_shape(hybrid_region_align) + tm_fill("Hybrid_Region")
If I squint, that looks like the USA, but the outlying territories make it hard to focus on the lower 48. We could use limits to zoom, but in this case I’ll use drop crumbs from the smoothr package to get rid of smaller areas like islands.
There seem to be small gaps. Let’s use the fill_holes tool to address that.
area_thresh <- units::set_units(10000, km^2)
hybrid_region_align <- fill_holes(hybrid_region_align, area_thresh)
ggplot(hybrid_region_align, aes(fill=Hybrid_Region)) + geom_sf() + ylim(24,51) + xlim(-129,-65) + theme(legend.position = "none")
Looking good, let’s try to simplify some of those lines.
hybrid_region_align_simplify <- st_simplify(hybrid_region_align, preserveTopology = T,dTolerance = .2)
## Warning in st_simplify.sfc(st_geometry(x), preserveTopology, dTolerance):
## st_simplify does not correctly simplify longitude/latitude data, dTolerance
## needs to be in decimal degrees
ggplot(hybrid_region_align_simplify, aes(fill=Hybrid_Region)) + geom_sf() + ylim(24,51) + xlim(-129,-65) + theme(legend.position = "none")
hybrid_region_ms <- fill_holes(smooth(rmapshaper::ms_simplify(hybrid_region_align, keep = 0.03, keep_shapes = TRUE), method = "spline"), area_thresh)
#hybrid_region_ms <- rmapshaper::ms_simplify(hybrid_region_align, keep = 0.01, keep_shapes = TRUE)
ggplot(hybrid_region_ms, aes(fill=Hybrid_Region)) + geom_sf() + ylim(24,51) + xlim(-129,-65) + theme(legend.position = "none")
amap <- tm_shape(hybrid_region_ms) + tm_fill("Hybrid_Region")#, palette = wes_palette("Moonrise1", region_count, type = "continuous")) ##, palette = wes_palette("Darjeeling1", region_count, type = "continuous"))
tmap_save(amap, "SampleMap.html")
## Interactive map saved to C:\Users\nswitzer\Stryker\QuarterBack - Endo Quota Setting - PBI Files\SampleMap.html
tmap_save(amap, "C:/Users/nswitzer/Documents/GitHub/switzloco.github.io/index.html")
## Interactive map saved to C:\Users\nswitzer\Documents\GitHub\switzloco.github.io\index.html
These lines write the shapefiles many different shapefile types.
#glimpse(hybrid_region_ms)
#hybrid_region_ms_dc <- hybrid_region_ms[, !grepl("area", names(hybrid_region_ms))]
#st_write(hybrid_region_ms, "Hybrid Regions - Simplified - Fewer Holes.shp")
#st_write(hybrid_region_ms, "Hybrid Regions - Simplified.geojson")
#st_write(hybrid_region_align, "Hybrid Regions Shape.shp")
#st_write(hybrid_region_align, "Hybrid Regions Shape.sqlite")
#st_write(hybrid_region_ms, "Hybrid Regions Shape4.csv")
That’s it for now. Let’s review and discuss. Thanks!
These are just notes about making markdown documents like this one:
This is an R Markdown Notebook. When you execute code within the notebook, the results appear beneath the code.
Try executing this chunk by clicking the Run button within the chunk or by placing your cursor inside it and pressing Ctrl+Shift+Enter.
Add a new chunk by clicking the Insert Chunk button on the toolbar or by pressing Ctrl+Alt+I.
When you save the notebook, an HTML file containing the code and output will be saved alongside it (click the Preview button or press Ctrl+Shift+K to preview the HTML file).
The preview shows you a rendered HTML copy of the contents of the editor. Consequently, unlike Knit, Preview does not run any R code chunks. Instead, the output of the chunk when it was last run in the editor is displayed.